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Abstract 

The  terms  adaptation,  learning,  concept-formation,  Induction,  self-organization,  and 
self-repair  have  all  been  used  In  the  context  of  learning  system  (LS)  research,  in  this 
article,  three  distinct  approaches  to  machine  learning  and  adaptation  are  considered:  (i)  the 
adaptive  control  approach,  (il)  the  pattern  recognition  approach,  and  (lil)  the  artificial 
Intelligence  approach. 

Progress  In  each  of  these  areas  is  summarized  In  the  first  part  of  the  article.  In  the 
next  part  a general  model  for  learning  systems  is  presented  that  allows  characterization  and 
comparison  of  Individual  algorithms  and  programs  In  all  of  these  areas.  The  model  details  the 
functional  components  felt  to  be  essential  for  any  learning  system,  Independent  of  the 
techniques  used  for  its  construction,  and  the  specific  environment  In  which  It  operates. 
Specific  examples  of  learning  systems  are  described  in  terms  of  the  model. 


1 To  appear  In  J.  Belzer  (Ed.),  Encyclopedia  Of  Computer  Science  And  Technology,  Marcel 
Dekker,  Inc.,  New  York,  1078,  Vol.  11.  This  research  has  been  supported  In  part  by  the 
National  Institutes  of  Health  Grant  No.  6R24  RR  00612-00;  Advanced  Research  Projects 
Agency  Grant  No.  MDA  003-77-C-02777;  and  the  Department  of  National  Defence  of 
Canada.  — — 
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1 Introduction 

Giving  a machine  the  ability  to  learn,  adapt,  organize  or  repair  Itself  are  among  the 
oldest  and  moat  ambitious  goals  of  computer  science.  In  the  early  days  of  computing,  these 
goals  were  central  to  the  new  discipline  called  cybernetics  [Wiener,  1 946],  [Ashby,  1 966]. 
Over  the  past  two  decades,  progress  toward  these  goals  has  come  from  a variety  of  fields— 
notably  computer  science,  psychology,  adaptive  control  theory,  pattern  recognition,  and 
philosophy.  Substantial  progress  has  been  made  in  developing  techniques  for  machine 
learning  in  highly  restricted  environments.  Computer  programs  have  been  written  that  can 
learn  to  play  good  checkers  [Samuel,  1963],  [Samuel,  1967],  learn  to  filter  out  the  strong 
heartbeat  of  a mother  In  order  to  pick  out  the  weaker  heartbeat  of  the  fetus  [Widrow, 
1 973],  or  learn  to  predict  the  mass  spectra  of  complex  molecules  [Buchanan,  1 978].  Each 
of  these  programs,  however,  is  tailored  to  Its  particular  task,  taking  advantage  of  particular 
assumptions  and  characteristics  associated  with  Its  domain.  The  search  for  efficient, 
powerful,  and  general  methods  for  machine  learning  has  come  only  a short  way. 

The  terms  adaptation,  learning,  concept-formation,  Induction,  self-organization,  and 
self-repair  have  alt  been  used  in  the  context  of  learning  system  (LS)  research.  The 
research  has  been  conducted  within  many  different  scientific  communities,  however,  and 
these  terms  have  come  to  have  a variety  of  meanings.  It  Is  therefore  often  difficult  to 
recognize  that  problems  that  are  described  differently  may  in  fact  be  identical.  Learning 
system  models  as  well  are  often  tuned  to  the  requirements  of  a particular  discipline  and  are 
not  suitable  for  application  in  related  disciplines. 

Thu  term  learning  system  is  very  broad,  and  often  misleading.  In  the  context  of  this 
article,  a learning  system  is  considered  to  be  any  system  that  uses  information  obtained 
during  one  Interaction  with  its  environment  to  Improve  its  performance  during  future 
interactions.  This  rough  characterization  may  include  man/machlne  systems  (see  [McCarthy, 
1968])  In  which  humans  take  on  active  roles  as  required  functional  components.  In  some 
systems  there  Is  continuous  Interaction  with  the  environment,  with  feedback  and  subsequent 
improvement.  In  other  systems  there  Is  a sharp  distinction  between  the  Interactions  that 
constitute  training  and  subsequent  performance  or  predictions  with  no  further  training. 
Another  way  of  differentiating  between  various  learning  systems  Is  on  the  basis  of  what 
kinds  of  alterations  they  perform. 

Figure  1 shows  several  classes  of  systems  that  fit  the  above  characterization  and  lists 
the  kinds  of  alterations  that  they  perform.  Data  base  systems  are  among  the  earliest  kinds 
of  systems  that  fit  our  definition.  Such  systems  represent  information  about  their 
environment  by  sets  of  alterable  assertions.  In  the  late  1960's  and  early  1960'a,  adaptive 
control  techniques  were  first  used  to  build  programs  that  alter  parameters  In  equations  which 
model  some  aspect  of  the  external  world  [Samuel,  1963],  [Widrow,  1973].  The 
perceptrons  of  the  early  1960's  [Minsky,  1972],  [Rosenblatt,  1966]  represent  an  attempt 
to  use  adaptive  control  techniques  to  train  recognition  networks  by  altering  weighting 
parameters.  More  recently,  concept  formation  (and  other)  systems  have  been  written  which 
build  and  alter  structural  representations  as  their  model  of  the  external  world.  In  short,  an 
important  difference  to  be  noted  In  LSs  Is  their  internal  representations  of  the  outer 
environment:  some  are  mathematical  models,  some  are  Nnguistlo  assertions,  end  still  others 
are  structures  encoding  symbolic  relations. 
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Figure  1.  A Spectrum  Of  Learning  Systems. 

In  this  article,  three  distinct  spproaches  to  machine  learning  and  adaptation  are 
considered:  (i)  the  adaptive  control  approach,  (U)  the  pattern  recognition  approach,  and  (iii) 
artificial  intelligence  approach. 

Progress  in  each  of  these  areas  is  summarized  in  the  first  part  of  the  article.  In  the 
next  part  a general  model  for  learning  systems  is  presented  that  allows  characterization  and 
comparison  of  individual  algorithms  and  programs  in  all  of  these  areas.  Specific  examples  of 
learning  systems  are  described  In  terms  of  the  model. 


2 Adaptive  System  Approach  to  Learning 

In  the  control  literature,  learning  Is  generally  assumed  to  be  synonymous  with 
adaptation.  It  Is  often  viewed  as  estimation  or  successive  approximation  of  the  unknown 
parameters  of  a mathematical  structure  that  has  been  chosen  by  the  LS  designer  to 
represent  the  system  under  study  [Donalson,  1065],  [Fu,  1070].  Once  this  has  been  done, 
control  techniques  known  to  be  suitable  for  the  particular  chosen  structure  can  be  applied. 
Thus  the  emphasis  has  been  on  paramefer  learning,  and  the  achievement  of  stable,  reliable 
performance  [Sklansky,  1064].  Problems  are  commonly  formulated  In  stochastic  terms,  and 
the  use  of  statistical  procedures  to  achieve  optimal  performance  with  respect  to  some 
performance  criterion  such  as  mean  square  error  is  standard  [Wittenmark,  1076]. 

There  are  many  overlapping  and  sometimes  contradictory  definitions  of  the  terms 
related  to  adaptive  systems.  The  following  set,  formulated  by  Gtorioso  [Glortoso,  1976], 
serves  to  illustrate  the  main  features.  An  adapt/ve  system  Is  defined  ss  a system  that 
responds  acceptably  with  respect  to  some  performance  criterion  m the  face  of  changes  In 
the  environment  or  Its  own  Internal  structurs.  A lamming  ayatmm  Is  an  adaptive  system  that 
responds  acceptably  within  some  time  interval  following  a change  In  Its  environment,  and  a 
aatf-mpmlrlng  system  Is  one  that  responds  acceptably  within  some  time  Interval  following  a 
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change  In  Its  Internal  structure.  Finally,  a *elf -organizing  system  is  an  adaptive  or  learning 
system  In  which  the  Initial  state  Is  unknown,  random,  or  unimportant. 

Adaptive  control  Is  an  outgrowth  of  automatic  control  that  has  attracted  significant 
research  effort  since  the  mid- 1950's  [Asher,  1976].  These  investigations  have  been 
motivated  by  a desire  for  development  of  real-time  control  of  incompletely  known  systems  or 
plant*.  Limited  plant  specification  Is  normally  assumed  to  entail  unknown,  drifting  parametera 
in  a prescribed  mathematical  description.  Various  methods  of  adaptive  control  have  been 
implemented  for  control  of  aerospace  and  industrial  processes,  as  well  as  man-machine  and 
socioeconomic  systems. 

Adaptive  controllers  have  been  coarsely  divided  into  two  large  classes  of  active  and 
passive  adaptivity  [Tse,  1973].  Active  edeptlve  controller*  are  based  on  dual  control 
theory  [Fel'dbaum,  1 966].  In  addition  to  the  available  real-time  Information,  they  utilize  the 
knowledge  that  future  observations  will  be  made  that  will  provide  further  possible 
performance  evaluation,  and  regulate  their  learning  accordingly.  Pass/ve  adaptive 
controller a utilize  the  available  real-time  measurements  but  Ignore  the  availability  of  future 
observations.  This  limitation  results  In  much  simpler  adaptive  algorithms.  Thus  passive 
techniques  have  been  much  more  extensively  investigated. 


2.1  Passive  Controllers 

Passive  adaptive  controllers  can  be  subdivided  into  two  classes:  indirect  and  direct, 
denoting  the  primary  focus  of  the  adaptation  mechanism  either  on  plant  parameter 
determination  or  control  parameter  determination,  respectively. 

Indirect  adaptive  control,  originally  suggested  in  [Kalman,  1 966],  arbitrarily 
separates  the  control  task  Into  plant  Identification  and  control  law  calculation  from  the  plant 
parameter  estimates.  This  approach  was  designed  to  utilize  the  existing  arsenal  of  control 
techniques  requiring  exact  specification  of  the  plant.  Acceptance  of  this  method  has  led  to 
considerable  interest  In  system  Identification  [Astrom,  1971].  Most  parameter  estimation 
schemes,  however,  are  Inherently  open  loop  and  suffer  consistency  and  Identifiability 
constraints  when  encompassed  by  feedback.  This  limitation  can  be  circumvented  by  the 
injection  of  a perturbation  Input  [Sarldis,  1976]. 

The  alternative,  which  avoids  the  necessity  of  proper  plant  identification,  is 
direct  adaptive  control.  In  which  the  available  control  parameters  themselves  are  adjusted 
In  order  to  improve  the  overall  performance  of  the  control  system.  Two  broad  techniques 
exist  for  establishment  of  convergent  control  parameter  adaptation  schemes:  search 
methods  and  stability  analysis.  Search  technique*  generally  suffer  local  convergence, 
whether  based  on  gradient  [Hasdorff,  1976]  or  heuristic  [Fu,  1970]  methods. 
Alternatively,  adaptive  control  algorithms  arising  from  atabillty  analyala  can  guarantee  global 
asymptotic  stability  as  a by-product.  The  widest  application  of  stability  theory  to  adaptive 
control  design  has  utilized  Liapunov's  second  method  [Llndorff,  1973].  The  earliest 
application  of  Liapunov  function  synthesis  for  designing  adaptive  loops  [Shackeloth,  1 966] 
utilized  a model  reference  approech. 
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Model  reference  adaptive  control  techniques  (see  example  in  appendix)  Implement 
adjustment  of  reachable  parameters  In  the  overall  controlled  system  so  that  Ita  reaponse  to 
some  reference  signal  exactly  matches  that  of  a predetermined  model  due  to  the  same 
reference.  Such  a structural  arrangement  In  general  requires  the  ability  to  adjust  each 
parameter  Independently  In  the  overall  controlled  system.  Assumption  of  this  capability 
hampers  the  current  sophisticated  schemes  of  adapting  feedforward  and  feedback 
parameters  solely  from  plant  input  and  output  measurements  [Landau,  1874a],  [Monopoll, 
1874]  by  occasionally  necessitating  an  unbounded  control  effort.  Control  effort 
boundedness  Is  encouraged  by  abandoning  exact  output  matching  for  Input  matching 
[Johnson,  1870],  which  requires  nonparametrlc,  a posteriori  determination  of  the  optimal 
Input. 


No  single  adaptive  control  approach  mentioned  Is  without  limitations  In  attempting  to 
provide  adequate  control  of  a plant  known  only  to  be  descrlbable  within  a general  structural 
class.  The  primary  focus  of  adaptive  control  on  parameter  selection  has  led  to  provably 
convergent  single  level  schemes.  The  ongoing  merger  of  heuristic,  layerable  learning  system 
concepts  (as  described  below)  with  these  convergent  parameter  adjustment  algorithms  of 
restricted  applicability  should  Improve  the  efficacy  of  adaptive  control. 


3 Pattern  Recognition  Approach  to  Learning 

Pattern  recognition  techniques  are  primarily  employed  at  the  interface  of  intelligent 
agents  and  the  real  world  of  physical  measurements  and  processes.  The  interface  attempts 
to  provide  some  sensory  capability  to  the  agent,  such  as  vision,  touch,  or  some  other  non- 
human sensory  modality.  In  this  context,  a patum  may  be  an  image,  a spoken  word,  a radar 
return  from  an  aircraft,  or  whatever  Is  appropriate  to  describe  or  classify  a physical 
environment  that  Is  viewed  through  a particular  set  of  sensors. 

The  problem  of  patfer/r  recognition  Is  often  viewed  as  the  development  of  a set  of 
rules  that  can  be  used  to  assign  observed  patterns  to  particular  known  classes  by 
examination  of  a set  of  patterns  of  known  class  membership.  There  are,  however,  a variety 
of  related  problems  that  can  be  discussed  in  the  same  framework.  These  Include 
pattern  classification,  In  which  the  classification  rules  are  known,  and  the  problem  is  simply 
assignment  of  patterns  to  classes,  pattern  formation,  In  which  the  classes  themselves  must 
be  defined,  and  pattern  description,  in  which  the  problem  is  to  form  descriptions  (which  are 
often  symbolic  in  form)  of  the  observed  patterns  rather  than  assign  them  to  classes. 

The  major  concerns  In  pattern  recognition  are: 

convergence:  the  learning  system  should  eventually  settle  on  a stable  set  of  rules, 

classes,  or  descriptions. 

optimality:  the  objective  is  minimization  of  some  cost  functional,  such  as  the  average 

risk  associated  with  classification. 

computational  complexity:  the  objective  Is  minimization  of  the  difficulty  of  using  an 
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algorithm,  measured  in  terms  of  computation  time,  memory  requirements,  or  programming 
complexity. 


3.1  Pattern  Recognition  Subclasses 

Pattern  recognition  is  presently  characterized  by  two  major  approaches:  the  statistical 
decision-theoretic  or  discriminant  approach,  which  employs  a classification  modal,  and  the 
linguistic  (syntactic),  or  structural  approach,  which  employs  a description  modal.  The  first 
approach  has  been  more  extensively  studied,  and  a modestly  large  body  of  theory  has  been 
constructed,  whereas  the  second  approach  is  relatively  new,  and  many  unsolved  problems 
remain. 

The  decision-theoretic  approach  commonly  the  extraction  of  a set  of 

characteristic  (typically  low-level)  measurements,  or  features,  from  a set  of  patterns.  Each 
pattern  is  thus  represented  as  a feature  vector  in  a feature  space,  and  the  task  of  the 
pattern  recognition  device  is  to  partition  the  feature  space  in  such  a way  as  to  classify  the 
Individual  patterns.  Features,  then,  are  usually  chosen  so  that  the  distance  (on  some  suitable 
metric)  between  patterns  in  the  feature  space  is  maximized  [Roche,  1974].  This  approach 
has  been  successful  for  applications  such  as  communication  of  a known  set  of  signal 
waveforms  corrupted  by  some  form  of  distortion,  such  as  noise  or  multipath  interference. 
However,  it  has  been  criticized  because  It  Is  concerned  only  with  statistical  relationships 
between  features,  and  tends  to  ignore  other  structural  relationships  that  may  characterize 
patterns.  [Kanal,  1974]. 

The  linguistic,  or  structural  approach  has  been  developed  in  part  to  correct  some  of  the 
difficulties  seen  in  the  decision-theoretic  approach.  With  this  paradigm,  patterns  are  viewed 
as  compositions  of  components,  called  subpatterns,  or  pattern  primitives,  that  are  typically 
higher-level  objects  than  the  features  of  the  decision-theoretic  model.  Patterns  are  often 
viewed  as  sentences  in  a language  defined  by  a formal  grammar  (sometimes  called  a pattern 
grammar).  Segmentation  of  patterns  into  primitives  and  formation  of  structural  descriptions 
are  thus  the  primary  Issues.  This  approach  embodies  an  attempt  to  use  other  sources  of 
information  as  aids  to  pattern  recognition  (e.g.,  in  a speech  understanding  system  [Reddy, 
1973],  [Erman,  1973],  [Lesser,  1976],  [Rovner,  1976],  [Reddy,  1976],  syntax, 
semantics,  and  context  act  as  powerful  sources  of  Information  In  addition  to  the  recorded 
Information). 

In  that  both  parametric  and  structural  techniques  are  applied,  pattern  recognition 
effects  a bridge  between  the  adaptive  systems  and  artificial  intelligence  approaches  to 
learning  system  design.  We  have  recently  begun  to  see  a merger  of  the  two  approaches 
(see,  for  example  Stockman,  [Stockman,  1977]),  that  may  result  In  more  powerful  systems. 
For  a review  of  the  current  state  of  the  art,  see  [Chen,  1977],  [Pavlldis,  1977],  [Kanal, 
1977]  and  [Proceedings,  1976]. 

The  remainder  of  this  section  contains  brief  descriptions  of  major  approaches  to 
pattern  recognition.  Specific  techniques  are  grouped  according  to  their  bias  toward  one  of 
the  two  primary  models:  the  classification  model  and  the  description  model.  Artificial 
intelligence  research,  discussed  In  the  next  section,  has  been  a major  factor  Involved  in  the 
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movement  away  from  complete  adherence  to  the  classification  model  and  towards  exploration 
of  the  description  model. 

3.1.1  Classification  Model 

In  this  model,  patterns  (feature  vectors)  are  viewed  as  members  of  a class  and  the  aim 
Is  to  assign  observed  patterns  to  classes.  The  classification  may  be  either  statistics/, 
wherein  the  patterns  are  thought  to  belong  to  one  of  a number  of  classes  according  to  a set 
of  probability  density  functions,  or  fuzzy,  wherein  patterns  are  thought  to  have  differing 
degrees  of  membership  In  a number  of  classes  [Zadeh,  1373]. 

Variation* 

Classifiers  may  be  categorized  in  a number  of  ways,  depending  on  tha  type  of 
classification  rule,  and  the  sampling  procedure  they  employ  [Hunt,  1970]. 

Parallal  classifiers  base  their  classifications  upon  the  complete  set  of  features, 
extracted  simultaneously  during  a single  observation  of  a pattern.  Saquantlal  classifiers 
assign  a pattern  to  a class  on  the  basis  of  a sequence  of  observations.  After  each 
observation  is  made,  and  Integrated  with  past  observations,  a decision  is  made  as  to  whether 
sufficient  information  has  been  gathered  upon  which  to  base  a classification,  or  whether 
..another  observation  must  be  made,  according  to  a test  like  the  Wald  Sequential  Likelihood 
Ratio  Test  [Wald,  1947]. 

Adaptium  classifiers  (see  example  in  appendix)  are  distinguished  by  the  fact  that  their 
classification  rules  are  themselves  adjusted  to  Improve  performance  as  experience  is  gained 
with  patterns  drawn  from  the  various  classes  of  Interest  (a  variety  of  procedures  have  been 
developed  to  adjust  the  rules— see,  for  example  [Widrow,  I960]).  Non-adaptlva 
classifiers,  on  the  other  hand,  use  a fixed  set  of  classification  rules,  and  In  the  language  of 
this  paper  are  not  considered  to  be  learning  systems. 

Bayaslan  Classification 

This  type  of  classification  Is  optimal  in  the  probability  of  error  sense.  The  strategy  is 
minimization  of  the  average  risk  of  a classification  and  complete  knowledge  of  the  a priori 
and  conditional  probability  densities  Is  assumed  (where  the  a priori  probability  is  the 
probability  that  a pattern  is  drawn  from  a particular  class,  regardless  of  its  observed 
characteristics,  and  the  conditional  probability  is  the  probability  that  a pattern  with  the 
observed  characteristics  could  have  been  drawn  from  a particular  class).  The  notion  of  risk 
arises  because  costs  are  assumed  to  be  associated  with  different  types  of  classification 
errors.  When  equal  costa  are  assumed  for  all  types  of  error,  the  result  Is  the  maximum  e 
posteriori  (MAP)  classifier  (where  the  a posteriori  probability  Is  the  probability  that  a pattern 
has  been  drawn  from  a particular  class,  based  on  Its  observed  characteristics). 
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Maximum  Likelihood  Classification 

Likelihood  is  the  conditional  probability  that  the  observed  characteristics  of  a pattern 
indicate  that  it  should  be  assigned  to  a particular  class.  No  knowledge  of  a priori 
probabilities  is  assumed,  but  the  method  does  assume  knowledge  of  the  form  of  the  density 
functions  (e.g.,  Gaussian). 


Nonparametrlc  Classification 

This  type  of  classification  does  not  guarantee  the  best  possible  performance  but 
requires  no  knowledge  of  the  underlying  probability  density  functions  that  govern  the 
generation  of  patterns.  Techniques  used  in  non-parametric  classification  include  the  K 
Nearest  Neighbor  Rule,  which  bypasses  probabilities  altogether,  and  assigns  patterns  to 
classes  based  on  the  proximity  of  their  observed  characteristics  to  those  of  neighboring 
patterns  of  known  class  membership,  and  the  Fisher  Linear  Discriminant,  which  is  used  to 
transform  the  feature  space  into  another  (decision)  space  (typically  of  lower  dimensionality), 
in  which  parametric  procedures  can  be  employed  [Duda,  1973]. 


3.1.2  Description  Model 

With  this  model,  emphasis  is  placed  on  segmentation  of  the  patterns  into  a set  of 
meaningful  primitives,  and  on  generation  of  structural  descriptions  (generally  symbolic  in 
form)  of  the  patterns.  It  is  further  assumed  that  a great  deal  of  a priori  knowledge  of  the 
pattern  types  that  are  of  interest  is  available. 

The  approach  is  useful  in  applications  like  scene  analysis  [Duda,  1973],  [McCarthy, 
1974]  where  classification  is  clearly  inappropriate.  It  also  tends  to  be  useful  when  the 
patterns  themselves  are  complex  [Fu,  1977],  as  it  emphasizes  hierarchical  decomposition  of 
patterns  into  their  constituent  components. 

There  are  a variety  of  descriptive  formalisms  In  which  to  express  the  structural 
descriptions.  These  include  pattern  grammars  [Fu,  1974],  and  relational  graphs  [Winston, 
1 976].  Pattern  grammars  embody  an  attempt  to  carry  over  a large  amount  of  theory  from 
the  study  of  natural  and  programming  languages.  A variety  of  pattern  grammars  have  been 
developed  [Kanal,  1 974],  both  deterministic  and  stochastic  in  form.  Relational  graphs  have 
been  used  in  pattern  recognition  systems  developed  by  the  artificial  intelligence  community 
(see,  for  example,  Winston  [Winston,  1970]).  Pattern  primitives  are  taken  as  nodes  in  a 
directed  graph  whose  edges  indicate  the  relations  between  the  primitives.  Such  graphs  form 
a convenient  representation  for  patterns  with  a high  degree  of  hierarchical  structure. 

The  text  by  Duda  and  Hart  [Duda,  1973]  Is  an  excellent  introduction  to  the  methods 
used  in  the  structural  approach. 
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Iin  the  1 960's  and  early  1 960's  there  was  considerable  discussion  of  learning  programs 
In  the  Artificial  Intelligence  (Al)  literature  (e.g.,  [Oettlnger,  1952],  [Friedberg,  1968], 
[Self ridge,  1969],  [Newell,  1982],  [Felgenbaum,  1983],  [Minsky,  1963]  and  [Simon, 
1966]).  It  was  hoped  at  the  time  that  a general  learning  program  could  be  written  to 
accumulate  and  refine  a large,  detailed  knowledge  base  about  a domain  [Minsky,  1972].  The 
knowledge  base,  then,  could  be  used  by  ever-improving  high  performance  programs  that 
reason  in  that  domain.  Samuel's  programs  that  learn  to  play  excellent  checkers  [Samuel, 
1963]  were  an  early  demonstration  of  success,  but  also  demonstrated  the  amount  of  effort 
necessary  to  achieve  success.  On  the  reasons  why  learning  tasks  have  been  central  In  At, 
Newell  wrote  [Newell,  1973]: 

Inductive  tasks  have  always  been  a prominent  part  of  the  Al 
landscape.  The  reasons  for  this  seem  to  be  twofold.  For  one,  we 
have  Inherited  a classic  distinction  between  deduction  and  induction, 
so  that  the  search  for  Intelligent  action  should  clearly  look  to 
induction.  Second,  American  psychology  has  largely  identified  the 
central  problem  of  conceptual  behavior  with  the  acquisition  or 
formation  of  concepts—which  In  practice  has  turned  out  to  mean  the 
induction  of  concepts  from  a set  of  presented  exemplars. 

■ ' 

This  tendency,  shaped  strongly  by  Bruner,  Goodnow,  and  Austin's 
Study  of  Thinking  [Bruner,  1956],  derives  fundamentally  from  the 
emphasis  on  learning  that  has  characterized  American  psychology 
since  the  rise  of  behaviorism. 


The  motivation  for  writing  these  programs  Is  diverse.  Some  are  written  as  testable 
psychological  models  of  how  human  subjects  perform  a learning  task  (e.g., [Simon,  1963]  [Hunt, 
1963],  [Felgenbaum,  1963]  and  [Hunt,  1966]),  others  are  written  to  demonstrate  the 
feasibility  of  a method  (e.g.,  [Soloway,  1978]),  and  still  others  are  written  with  the  express 
purpose  of  aiding  human  problem  solvers  codify  and  explain  data  (e.g.,  [Buchanan,  1978]). 
Insofar  as  all  the  programs  mentioned  below  perform  well  at  their  stated  tasks,  they  all 
Illustrate  the  emerging  power  of  heuristic  programming  methods  for  improving  the  problem 
solving  power  of  computer  programs. 

All  the  A I learning  programs  written  to  date  have  strong  limitations  on  their  generality. 
Some  are  applicable  to  Just  one  kind  of  problem,  others  work  with  several  types  of  problems 
within  a larger  class  defined  by  the  representation  of  objects  and  relations  In  the  domain. 

Early  Al  research  was  closely  tied  to  pattern  recognition  and  the  adaptive  systems 
approach,  (see,  for  example  [Self ridge,  1963],  [Uhr,  1963]  and  [Uhr,  1973]).  Much  work 
has  been  performed  on  learning  automata  [Nilsson,  1965]  (see  also  [Narendra,  1974]),  and 
neural  networks  that  grow  In  response  to  stimuli  [Minsky,  1 972].  All  of  these  efforts  have 
aimed  at  defining  simple  machines  that  learn  to  respond  to  their  environments  [Findler, 
1969].  Newell  [Newell,  1973]  traces  one  line  of  growth  from  stimulus-response  learning  In 
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psychology  to  (i)  pattern  recognition  and  self -organizing  systems,  as  well  as  to  (ii)  concept  » 

formation,  induction  and  other  Al  work.  The  two  fields  diverged  in  the  1960's,  and  are  now 
quite  distinct.  Whereas  pattern  recognition  and  control  research  emphasizes  adjustment  of 
parameters,  A!  research  emphasizes  construction  of  symbolic  structures,  based  on 
conceptual  relations.  For  example,  Feigenbaum's  EPAM  program  [Feigenbaum,  1963]  used  a 
discrimination  net  (i.e.,  a tree  of  tests  and  branches)  to  store  the  relations  required  to 
recall  nonsense  syllables  in  a rote  learning  experiment  (see  [Fikes,  1972],  [Sussman, 

1973],  and  [Winston,  1976]  for  further  examples). 

In  Al,  it  is  commonly  believed  that  a learning  system  should  have  sufficient  internal 
structure  to  develop  a strong  theory  of  its  environment  [Feigenbaum,  1971],  [McCarthy, 

1968]  and  [Minsky,  1972a].  Much  emphasis  has  therefore  been  placed  on  building 
knowledge-based  or  expert  systems  that  not  only  have  the  capacity  for  high  performance, 
but  can  also  explain  their  performance  in  symbolic  terms  [Davis,  1976]. 

Various  levels  of  sophistication  in  learning  systems  are  described  by  [Winston,  1976]: 
learning  by  being  programmed,  learning  by  being  told,  learning  from  a series  of  examples,  and 
finally  learning  by  discovery.  We  see  in  this  categorization  a gradual  shift  in  responsibility  from 
the  designer/teacher  to  the  teaming  system/student.  At  the  highest  level,  the  system  is 
able  to  find  its  own  examples,  and  carry  on  autonomously;  at  the  lowest  level  the  system  is 
learning  only  in  the  sense  that  a programmer  is  explicitly  programming  it  to  do  something. 

The  formalism  of  Inductive  inference  has  captured  much  attention  also  (e.g., 

[Solomonoff,  1977],  [Holland,  1962],  [Hajek,  1976],  [Meltzer,  1970],  [Meltzer,  1973] 
and  [Plotkin,  1971]).  The  purpose  of  much  of  the  work  on  abstract  formalisms  is  to  find 
general  principles  of  Induction  that  can  be  mechanized.  This  was  also  a goal  of  Bacon  and 
Leibniz  centuries  ago. 

Considerable  work  Is  still  expended  on  the  Leibnizian  dream  of  an  abstract  formalism  for 
scientific  inference.  Some  of  this  work  Is  done  specifically  with  computer  programs  in  mind. 

Much  of  it,  however,  is  done  in  abstraction.  Programs  based  on  these  formalisms  form 
hypotheses  from  data  without  any  special  knowledge  of  the  domain  from  which  the  data  were 
collected.  The  drawback  of  very  general  methods  is  that  while  they  may  produce  some 
interesting  empirical  generalizations,  they  are  likely  to  produce  many  generalizations  that 
experts  in  the  domain  would  regard  as  trivial  or  meaningless.  In  short,  they  lack  a working 
model  of  the  domain  to  guide  judgments  of  plausibility. 

Some  recent  programs  explicitly  recognize  the  need  for  problem-specific  constraints.  The 
Meta-DENDRAL  program  [Buchanan,  1978]  discovers  general  rules  about  the  behavior  of 
chemical  compounds  In  an  analytic  instrument  known  as  a mass  spectrometer.  The  data  are 
noisy,  they  do  not  come  already  classified,  the  space  of  possible  explanations  is  very  large, 
and  there  is  no  single  correct  answer.  Nevertheless,  the  program  finds  regularities  in  these 
data  and  formulates  general  rules  to  explain  them. 

The  AQVAL  program  [Larson,  1976]  accepts  a set  of  descriptions  of  objects,  and 
produces  rules  that  can  correctly  classify  these  objects  and  others  like  them.  For  example, 
for  descriptions  of  Eastbound  and  Westbound  railroad  cars  containing  ciroles,  triangles, 
rectangles,  etc.,  the  program  Is  abls  to  find  the  shapes  and  relations  among  shapes  that 
discriminate  the  two  trains. 
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Still  another  program,  named  Thoth-pb  [Vera,  1978],  Is  able  to  learn  rules  for  (i) 
extending  letter  sequences,  (ii)  recognizing  geometric  analogies,  (Hi)  relating  before  and  after 
situations,  and  (iv)  relating  sequences  of  situations.  It  uses  background  knowledge  about 
the  domain  to  help  it  recognize  Important  relations  among  features  of  objects  that  are  not 
codified  in  the  descriptions  of  the  objects  themselves. 


4.1  Game  Playing 

Much  of  the  work  with  learning  systems  in  Al  research  has  been  done  In  the  context  of 
games.  Improvement  of  the  game  playing  program  is  the  ostensive  goal,  but  the  learning  task 
Itself  Is  often  the  reason  for  the  work  (see,  for  example,  [Newman,  1986]).  The  nature  of 
the  learned  information  ranges  from  parameter a governing  the  evaluation  of  moves  (and 
ultimately  their  selection)  to  aymbolle  rules  expressing  how  to  play  well  in  different 
situations. 

Samuel's  work  Is  best  known  In  this  field  [Samuel,  1963],  [Samuel,  1907].  In  the 
context  of  a checker-playing  program,  he  has  explored  rote  learning,  parameter  tuning,  and 
building  signature  tables,  which  are  clusters  of  dependent  features  with  weights  that  can 
be  used  to  evaluate  moves  (cf.  [White,  1970]).  (Griffith  [Griffith,  1974]  later  compared 
the  methods  used  by  Samuel  with  a simple  heuristic  procedure.)  Waterman  [Waterman, 
1970]  compared  the  performance  of  a poker-playing  program  after  learning  with  a human 
teacher  and  automated  learning.  The  program  represented  its  heuristics  of  good  play  in  a 
table  of  conditional  rules,  or  productions,  that  the  learning  system  altered  in  light  of  mistakes. 
Waterman  has  generalized  many  of  these  Ideas  to  other  tasks  [Waterman,  1976].  Findler 
[Findler,  1977]  has  also  studied  the  game  of  poker.  Pitrat's  work  on  learning  patterns  In 
chess  [Pitrat,  1974]  applies  many  heuristic  search  ideas  to  learning  useful  combinations 
from  examples  of  given  games.  Programs  have  also  been  written  to  learn  dominoes  [Smith, 
1973],  Go-Moku  [Elcock,  1967],  and  the  rules  of  Tlc-Tac-Toe  [Popplestone,  1969]. 
Banerjl  [Banerji,  1974]  has  studied  learning  processes  for  several  classes  of  games  and 
puzzles  from  a more  formal  point  of  view.  Koffman  [Koffman,  1968]  has  also  related  game 
playing  to  pattern  recognition. 


4.2  Concept  Formation 

In  concept  formation  tasks,  a computer  program  (or  human  subject)  Is  presented  with 
objects,  or  descriptions  of  objects,  that  exhibit  a common  concept.  The  program  (or  subject) 
is  expected  to  generalize  from  these  Instances  well  enough  to  classify  new  objects 
accurately.  Negative  instances— objects  which  fail  to  exhibit  the  concept— are  sometimes 
presented  to  the  program  (and  Identified  as  negative  Instances)  In  addition  to  the  exemplars 
of  the  concept.  When  training  Includes  negative  Instances  learning  Is  faster  and  more 
accurate.  Concept  formation  has  long  Interested  psychologists  as  a learning  task.  As  with 
other  learning  tasks,  computer  programs  have  been  written  to  simulate  the  performance  of 
human  subjects— and  thus  test  a psychological  model  [Simon,  1963].  Or  they  have  been 
written  to  learn  by  mechanisms  other  than  those  humans  use— and  thus  demonstrate  some 
modicum  of  intelligence  on  the  part  of  computers. 
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Two  frequently  cited  Al  concept  formation  programs  are  those  written  by  Evans 
[Evans,  1968]  and  Winston  [Winston,  1976].  Evans'  program  finds  analogies  among 
geometric  figures  to  solve  standard  intelligence  test  problems  of  the  form  A Is  to  B as  C is  to 
[pick  one  of  D1,  02,  03,  04,  D6].  The  concept  here  Is  a transformation  or  rule  which  maps 
figure  A Into  B and  also  maps  figure  C into  one  of  the  answer  choices. 

For  Winston's  program  the  task  Is  to  produce  a correct  description  of  a concept 
exhibited  in  a set  of  line  drawings  of  block  figures.  An  important  feature  is  the  introduction 
of  near  misses,  l.e.,  figures  that  fail  to  exhibit  the  concept  because  they  differ  with  respect 
to  a small  number  of  essential  properties.  The  program  learns  the  correct  description  of  an 
arch,  for  example,  from  descriptions  of  two  posts  and  a lintel  (exemplar)  and  of  near  misses 
such  as  tees  and  posts  with  a fallen  lintel. 

Another  recent  program  learns  concepts,  such  as  Hit  and  Out,  for  the  game  of  baseball 
from  a set  of  descriptions  of  events  over  the  span  of  a game  [Soloway,  1978].  Other 
concept  formation  programs  are  described  In  [Simon,  1963],  [Johnson,  1964],  [Hunt, 
1975],  [Zagorulko,  1976],  [Langley,  1977],  [Larson,  1978],  [Buchanan,  1978],  [Mitchell, 

1977] ,  [Hayes-Roth,  1976],  [Hayes-Roth,  1977],  [Hedrick,  1976]  and  [Rychener, 

1978] . 


4.3  Grammatical  Inference  and  Sequence  Extrapolation 

Grammatical  Inference  and  sequence  extrapolation  have  often  been  taken  as  prototype 
induction  problems.  The  task  is  to  find  a rule  (or  set  of  rules)  that  can  serve  as  the 
generating  principle  for  a training  set  of  symbol  strings.  For  example,  the  training  Instances 
may  be  the  following  allowable  sentences  in  a hypothetical  language:  A,  AB,  ABB,  ABBB.  An 
uninteresting  set  of  rules  is  Just  the  training  Instances  themselves.  Without  some 
generalization  from  the  training  instances,  prediction  of  new  sentences  is  impossible.  The 
following  two  rules,  then,  will  serve  to  define  the  grammar  of  which  these  strings  are  correct 
sentences: 


(Rl) 

A 

('A'  alone  is  a 

sentence) 

(R2) 

A ->  AB 

('A'  can  be  repl 

laced  by  'AB'] 

The  sequence  extrapolation  task  is  similar:  given  a sequence  of  symbols  (usually,  but 
not  always  numerals)  such  as  1,3, 6, 7, 9,  find  a rule  that  allows  correct  prediction  of  the  next 
member  of  the  ordered  sequence.  In  this  case,  the  generating  principle  Is 

(R3)  n*'  member  - 2n-l 


Both  of  these  problems  exhibit  many  characteristics  of  scientific  hypothesis  formation. 
Regularities  In  the  data  must  be  found  and  characterized,  different  generating  prlnclplea 
must  be  proposed  and  tested,  and  alternative  hypotheses  must  be  ranked,  for  example  by 
simplicity.  Most  programs  [Blerman,  1972],  [Persson,  1966]  assume  the  Initial  data  nre 
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free  of  errors.  Many  of  these  programs  explicitly  search  a space  of  hypotheses,  (e.g., 
Cook's  grammatical  Inference  program  [Cook,  1076]),  but  most  recent  work  on  grammatical 
Inference  emphasizes  more  formal  methods  [Bierman,  1072],  [Gold,  1067]  and  [Fu,  1076]. 

Inferring  natural  language  [Slklossy,  1072],  and  simple  computer  programs  from 
examples  are  other  Induction  tasks  that  have  been  studied  using  A I techniques  [Waterman, 
1076],  [Hardy,  1076],  [Shaw,  1077],  [Waterman,  1076].  The  training  instances  are  often 
Input-output  pairs  and  the  task  of  the  Induction  system  Is  to  find  the  rule  (procedure)  that 
will  produce  the  specified  output  symbols  for  each  associated  input.  While  the  tasks  are 
similar  to  concept  formation  and  grammatical  Inference,  the  languages  are  so  much  richer  that 
progress  is  slow. 


6 A Model  of  Learning  Systems 

This  section  is  concerned  with  a simple  functional  model  that  is  useful  for 
characterizing,  comparing,  and  designing  learning  systems.  Many  of  the  functional 
components  of  an  LS  are  essential  to  Intelligent  problem  solving  systems  In  general,  as  noted 
by  Simon  and  Lea  [Simon,  1973];  that  Is,  learning  (induction,  concept  formation,  etc.)  Is 
problem  solving  of  one  kind,  which  means  that  Al  problem  solving  methods  and 
representations  can  be  expected  to  apply  to  this  task  as  well  as  to  others. 


6.1  Effects  of  the  Environment 

The  environment  from  which  training  instances  are  drawn,  and  in  which  an  LS  operates, 
may  have  a profound  effect  upon  the  LS  design.  LS  environments  can  be  divided  into  two 
major  categories:  those  that  provide  the  correct  response  for  each  training  Instance 
(supervised  learning)  and  those  that  do  not  (unsupervised  learning).  Supervised  learning 
systems  operate  within  a stimulus-response  environment  in  which  the  desired  LS  output  is 
supplied  with  each  training  Instance.  Examples  Include  Samuel's  book  move  checkers 
program  [Samuel,  1963],  [Samuel,  1967],  and  grammatical  Inference  programs  [Hunt, 
1976]. 

Unsupervised  LSs  operate  within  an  environment  of  Instances  for  which  the  correct 
response  is  not  directly  available.  The  version  of  Samuel's  program  that  learns  by  playing 
checkers  against  an  opponent  falls  Into  this  category  [Samuel,  1963]  since  moves  are  not 
classified  by  opponents  as,  say,  excellent,  good,  poor  or  terrible.  Learning  systems 
operating  within  this  type  of  environment  must  themselves  infer  the  correct  response  to 
each  training  instance  by  observation  of  system  performance  for  a series  of  instances.  As  a 
result,  assignment  of  credit  or  blame  for  overall  performance  to  individual  responses  Is 
generally  a problem  for  these  systems  [Minsky,  1963].  Tsypkln  [Tsypkin,  1908]  has 
pointed  out  that  unsupervised  learning  is  somewhat  of  an  illusion  in  the  sense  that  e 
teacher/designer  defines  the  standards  that  determine  the  quality  of  operation  of  the  LS  at 
the  outset,  whether  or  not  he  is  present  during  the  actual  operation  of  the  system. 


Environments  can  be  further  categorized  as  noise-free  or  noisy.  Noise1 
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free  environments,  such  as  that  of  Winston's  structural  description  learning  program 
[Winston,  1876],  provide  instances  paired  with  correct  responses  which  the  system 
assumes  to  be  perfectly  reliable.  Most  Al  systems  assume  noise-free  environments.  (One 
exception  is  described  in  [Buchanan,  1978].)  Noisy  environments,  on  the  other  hand,  do 
not  provide  such  perfect  Information,  as  is  usually  the  case  when  empirical  data  are  Involved. 
Pattern  recognition  and  control  systems  frequently  operate  within  noisy  environments 
[Barrow,  1872],  [Duda,  1873],  [Donalson,  1966]. 


6.2  The  Model  - Overview 

The  proposed  LS  model  Is  shown  in  Figure  2.  The  PERFORMANCE  ELEMENT  is 
responsible  for  generating  an  output  in  response  to  each  new  stimulus.  The 
INSTANCE  SELECTOR  selects  suitable  training  Instances  from  the  environment  to  present  to 
the  performance  element.  The  CRITIC  analyzes  the  output  of  the  performance  element  in 
terms  of  some  standard  of  performance.  The  LEARNING  ELEMENT  makes  specific  changes  to 
the  system  in  response  to  the  analysis  of  the  critic.  Communication  among  the  functional 
components  is  shown  via  a BLACKBOARD  to  ensure  that  each  functional  component  has 
access  to  all  required  system  Information,  such  as  the  emerging  knowledge  base.  Finally,  the 
LS  operates  within  the  constraints  of  a WORLD  MODEL  which  contains  the  general 
assumptions  and  methods  that  define  the  domain  of  activity  of  the  system. 

The  components  of  the  model  are  conceptual  entities  that  specify  functions  that  must  ' 

be  performed  to  effect  learning.  Although  the  functional  decomposition  suggested  by  the 
model  Is  not  necessarily  reflected  in  the  physical  decomposition  of  many  existing  systems, 
the  model  is  useful  for  comparing  systems  and  may  aid  in  future  learning  system  designs. 
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The  following  sections  present  detailed  discussions  of  the  LS  model  components  shown 
In  Figure  2.  In  addition,  the  appendix  contains  detailed  characterizations  of  representative 
At,  pattern  recognition,  and  control  systems  In  terms  of  the  model.  The  reader  may  find  it 
helpful  to  refer  occasionally  to  the  appendix  while  reading  the  following  sections. 


6.3  Performance  Element 


The  performance  element  uses  the  learned  information  to  perform  the  stated  task.  It 
has  been  Included  In  the  LS  model  because  of  the  Intimate  relationship  between  what 
Information  is  to  be  learned  and  how  this  learned  Information  Is  to  be  used. 


Performance  elements  are  usually  tailored  more  to  the  requirements  of  the  task  domain 
than  to  the  architecture  of  the  LS.  In  general,  the  performance  element  can  be  run  In  a 
stand-alone  mode  without  learning,  Independent  of  the  rest  of  the  LS.  In  any  LS,  however, 
the  ability  to  improve  performance  presupposes  a method  of  communicating  learned 
Information  to  the  performance  element.  Since  Its  architecture  must  allow  learned  Information 
to  affect  Its  decisions,  additional  constraints  are  placed  on  the  performance  element  within 
an  LS.  The  performance  element  should  be  constructed  so  that  information  about  its  internal 
machinations  is  readily  available  to  the  other  system  components.  This  information  can  be 
used  to  make  possible  detailed  criticism  of  performance,  and  Intelligent  selection  of  further 
Instances  to  be  examined  by  the  system. 


The  performance  elements  of  existing  systems  also  vary  in  the  ways  they  may  be 
altered  by  learning.  For  example,  systems  whose  operation  Is  determined  by  a set  of 
production  rules  [Waterman,  1970],  [Waterman,  1976]  have  the  potential  to  exhibit  richer 
variations  than  systems  whose  operations  are  keyed  only  to  the  adjustment  of  parameter 
values  [Landau,  1974],  [Michle,  1974]. 


6.4  Instance  Selector 


The  instance  selector  selects  training  Instances  from  the  environment  that  are  to  be 
used  by  the  LS.  It  Is  a functional  component  not  clearly  Isolated  In  earlier  adaptive  system 
models. 


In  existing  LSs,  methods  for  instance  selection  vary  mainly  along  the  dimensions  of 
responsibility  and  sophistication.  The  roMponalblllty  for  instance  selection  varies  between 
the  extremes  of  completely  external  (pass/ve)  selection,  and  completely  internal  (acf/ve) 
selection.  In  psychological  experiments  on  concept  formation,  Instance  selection  Is  closely 
controlled  by  the  experimenter  and  the  subject  Is  completely  passive  In  this  respect. 
Instance  selection  In  Samuel's  book  move  checkers  program  [Samuel,  1963]  Is  externally 
controlled,  whereas  Poppleatone's  program  [Popplestone,  1969],  which  learna  the  features 
that  characterize  a winning  position  In  tic-tac-toe,  generates  its  own  training  instances.  It 
forms  alternate  hypotheses,  and  then  generates  instances  to  choose  among  them  (relying 
upon  an  external  critic  to  evaluate  these  Instances).  (See  also  [Simon,  1973].)  In  the 
adaptive  systems  literature,  Tse  and  Bar-Shalom  [Tse,  1976]  use  a form  of  active  instance 
selection  known  as  dumlmeontrol.  They  adjust  the  Input  to  a system  In  such  a way  as  to 
simultaneously  control  Its  output  and  obtain  Information  about  its  internal  structure. 
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The  degree  of  sophistication  used  for  LS  Instance  selection  Is  also  an  Important 
consideration.  In  order  to  qualify  as  sophisticated,  an  instance  selector  must  be  sensitive  to 
the  current  abilities  and  deficiencies  of  the  performance  element  and  must  construct  or 
select  Instances  which  are  designed  to  Improve  performance.  Winston  [Winston,  1 076]  has 
shown  the  advantages  to  be  accrued  through  presenting  carefully  constructed  examples  and 
near-misses  of  the  concepts  to  be  acquired  by  an  LS.  In  general,  careful  Instance  selection 
can  Improve  the  reliability  and  efficiency  of  an  LS.  It  is  important  to  note,  however,  that  this 
may  not  always  be  permitted  by  the  environment  in  which  the  LS  operates,  as  Is  generally 
the  case  for  adaptive  control  systems  [Donalson,  1 006]. 


6.6  Critic 

The  critic  analyses  the  current  abilities  of  the  performance  element.  It  may  play  three 
roles:  EVALUATION,  LOCALIZATION,  and  RECOMMENDATION.  The  critic  always  operates  as 
an  evaluator  In  that  It  embodies  a standard  by  which  to  assess  the  behavior  of  the 
performance  element.  This  Is  the  role  that  has  been  emphasized  in  earlier  adaptive  system 
models  [Fu,  1070],  [Glorioso,  1076],  [Sklansky,  1064].  Feedback  from  a critic  at  least  as 
evaluator  Is  essential  for  learning. 

The  critic  may  also  localize  errors  and  localize  the  reasons  for  poor  localize  the  reasons 
for  poor  performance.  This  type  of  behavior  is  essential  for  resolution  of  the  credit 
assignment  problem  described  by  Minsky  [Minsky,  1963].  In  its  diagnostic  role,  the  critic  is 
exemplified  by  the  bug  classifier  and  summarizer  In  Sussman's  HACKER  [Sussman,  1973]. 

Finally  the  critic  may  recommend  repairs  by  making  specific  recommendations  for 
improvement  or  suggestions  about  future  instances.  In  Waterman's  poker  player  [Waterman, 
1970],  the  critic  in  this  role  suggests  the  bet  that  should  have  been  made  by  the 
performance  element  for  a particular  training  instance.  The  critic  not  only  recognizes  poor 
play  and  isolates  the  production  rules  responsible  for  it,  but  suggests  specific  corrections  so 
the  program  will  not  play  as  poorly  in  similar  future  situations. 

The  dividing  line  between  critic  and  learning  element  is  not  sharp,  and  it  is  certainly 
possible  to  view  therapy  as  a function  of  either  the  learning  element  or  the  critic.  However, 
In  mapping  existing  LSs  into  this  model,  we  have  adopted  the  convention  that  the  critic's 
recommendations  to  the  learning  element  are  at  an  abstract  level  removed  from  the 
implementation  considerations  such  as  data  representation.  This  clearly  separates  the  two 
different  functions  of  deciding  what  kind  of  change  Is  needed  and  deciding  how  to  Implement 
that  change. 

In  some  LSs  the  functions  of  the  critic  have  been  left  to  humans.  For  example, 
MYCIN/TEIRESIAS  [Davis,  1976]  uses  a human  critic,  for  evaluation,  localization,  and 
recommendation.  The  performance  program  applies  rules  (to  cases  selected  by  humans)  and 
a human  supplies  criticism  of  results,  localization  of  blame,  and  suggestions  for  altering  the 
rule  base.  Because  the  computer  program  assists  the  user  In  these  tasks,  the  learning  can 
be  said  to  be  seml-automated. 
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6.0  Learning  Element 

The  learning  element  Is  an  Interface  between  the  critic  and  the  performance  element, 
responsible  for  translating  the  abstract  recommendations  of  the  critic  into  specific  changes 
In  the  rules  or  parameters  used  by  the  performance  element. 

Representations  for  learned  Information  exhibit  great  variety.  They  Include,  for  example 
production  rules  [Waterman,  1070],  parameterized  polynomials  [Samuel,  1063],  executable 
procedures  [Sussman,  1073],  signature  tables  [Samuel,  1067],  stored  facts  [Felgenbaum, 
1063],  and  graphs  or  networks  [Winston,  1076].  The  method  of  Incorporating  new  learned 
Information  Is  dependent  upon  the  representation,  and  even  among  systems  that  use  similar 
representations,  competing  methods  are  found  (contrast,  for  example,  [Buchanan,  1076] 
and  [Waterman,  1070]). 

The  extent  to  which  the  learned  Information  is  altered  in  response  to  each  training 
instance  is  an  Important  LS  design  consideration.  In  some  systems,  the  learning  element 
Incorporates  exactly  the  Information  supplied  by  the  critic  [Winston,  1076].  Were  the  same 
training  Instance  to  occur  later,  the  response  of  the  performance  element  would  be  exactly 
as  the  critic  advised  for  the  first  occurrence.  This  type  of  learning  Is  well  suited  to 
environments  that  provide  perfect  data  and  to  systems  with  reliable  critics.  Under  these 
conditions  the  LS  will  converge  rapidly  to  the  desired  behavior.  If  such  a system  were 
provided  with  an  Incorrect  classification  by  the  environment  or  less  than  reliable  advice  by 
the  critic,  however,  It  might  commit  itself  to  incorrect  assumptions  from  which  it  could  not 
recover.  Systems  that  make  less  drastic  changes  to  the  learned  knowledge  on  the  basis  of 
a single  training  instance  are  less  vulnerable  to  Imperfect  information,  but  consequently 
require  more  training  instances  to  converge  to  the  desired  behavior.  Many  statistical  LSs  fall 
Into  this  category  [Nilsson,  1966].  Other  systems  consider  several  training  Instances  at  a 
time  In  order  to  minimize  the  effect  of  occasional  noisy  Instances  [Buchanan,  1978]. 


6.7  Blackboard 

The  blackboard  of  this  model  is  a global  data  base  that  also  functions  as  a system 
communications  mechanism.  It  is  similar  to  the  concept  introduced  In  the  HEARSAY  system 
[Lesser,  1976].  The  blackboard  holds  two  types  of  information:  the  information  usually 
associated  with  the  knowledge  base  in  Al  programs,  and  the  temporary  Information  used  by 
the  LS  components.  The  knowledge  base  often  contains  the  set  of  rules,  parameter  values, 
symbolic  structures,  and  so  on,  currently  being  used  by  the  performance  element.  Such 
information  can  be  used  as  an  aid  to  sophisticated  instance  selection  If  it  is  readily  available. 
The  temporary,  system-oriented  information  Includes,  for  example,  the  Intermediate  decisions 
made  by  the  performance  element  In  selecting  a particular  response.  Detailed  criticism  by 
the  critic  is  dependent  upon  the  availability  of  this  information. 

In  many  existing  systems  this  Information  Is  not  so  clearly  separated  or  defined.  The 
communication  links  between  functional  components,  especially,  are  often  programmed 
directly.  Because  the  same  Information  Is  required  by  many  of  the  individual  functional 
components  of  any  LS,  however,  a blackboard  I*  a more  transparent  communications 
meohanism. 
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6.8  World  Model 

Whereas  the  blackboard  contains  Information  that  can  be  altered  by  the  LS 
components,  the  world  model  contains  the  fixed  conceptual  framework  within  which  the 
system  operates  [Churchman,  1970].  The  contents  of  the  world  model  include  definitions  of 
objects  and  relations  In  the  task  domain,  the  syntax  and  semantics  of  the  information  to  be 
learned,  and  the  methods  to  be  used  by  the  LS.  Among  task  domain  definitions  are,  for 
example,  the  rules  of  a game  and  the  representation  of  Inputs  and  outputs  for  the 
performance  element.  This  part  of  the  world  model  simply  defines  the  task  of  the 
performance  element,  and  the  standard  of  performance  (the  evaluation  function)  to  be 
applied  by  the  critic.  Domain  specific  heuristics  are  also  commonly  added  to  the  world  model 
of  Al  systems  to  guide  Inferences  made  by  the  LS  (e.g.,  heuristics  about  the  world  of  blocks 
In  Winston's  program  [Winston,  1976]).  Definitions  of  the  syntax  and  semantics  of 
Information  to  be  learned  define  the  mode  of  communication  between  the  learning  and 
performance  elements. 

The  assumptions  and  constraints  from  which  the  world  model  is  composed  are  of  critical 
importance  in  the  design  and  characterization  of  LSs.  Although  many  of  these  assumptions 
are  often  hidden  in  the  various  functional  components,  the  LS  designer  and  user  must  both  be 
aware  of  each  of  them.  We  believe  that,  where  possible,  world  model  constraints  should  be 
made  explicit  in  order  to  allow  for  their  modification  during  the  design  process. 


6.9  Multi-Layer  Learning  Systems 

Although  the  world  model  cannot  be  altered  by  the  LS  that  uses  it,  the  designer  can 
alter  its  contents  in  order  to  improve  LS  performance.  He  often  changes  parameters  and 
procedures  of  the  basic  LS  after  observing  and  criticizing  its  behavior  for  some  carefully 
chosen  training  set.  These  alterations  result  in  a new  version  of  the  LS,  which  is  then  tested 
on  some  training  set,  and  so  on.  The  designer  views  the  whole  LS  as  a system  whose 
performance  needs  improvement,  and  he  selects  Instances,  criticizes  performance,  and 
makes  changes  accordingly.  In  other  words,  the  designer's  activities  can  be  modeled  by  a 
system  whose  components  are  just  those  of  Figure  2.  This  leads  us  to  the  concept  of 
layered  LSs,  each  higher  layer  able  to  change  the  world  model  (vocabulary,  assumptions, 
etc.)  of  the  next  lower  layer  on  the  basis  of  criticizing  its  performance  on  a chosen  set  of 
Instances.  Thus,  adjustments  can  be  made  to  the  world  model  of  some  learning  system  LSI 
by  another  learning  system,  LS2,  that  has  its  own  functional  components  (critic,  world  model, 
etc.),  as  shown  in  Figure  3.  In  turn.  It  is  conceivable  that  a third  system,  LS3,  could  adjust 
the  world  model  of  LS2,  and  so  on.  The  designer  constitutes  the  final  critic,  of  course, 
operating  above  the  top-ltv*l  LS.  Each  lower  layer  constitutes  the  performance  element  of 
the  next  higher  layer,  and  Inter-layer  communication  is  effected  through  the  blackboards  of 
the  various  layers.  The  use  of  a blackboard  In  the  single  layer  LS  model  was  partly  motivated 
by  its  attractiveness  In  the  multi-layer  context. 
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Figure  3.  Layering  of  Learning  System*.  (Components  are  labelled  as  in  Figure  2). 

This  multi-layer  architecture  Involves  bidirectional  information  passing;  that  is,  the 
effects  of  adjustments  made  in  a layer  may  propagate  both  to  lower  and  higher  level  layers. 
It  is  a hierarchical  architecture,  In  the  general  sense  [Simon,  1869]  and  Includes  as  a 
specific  case  the  bottom-to-top  hierarchical  architecture  used,  for  example,  by  Soloway 
[Soloway,  1977]. 

One  existing  LS  which  may  be  viewed  as  a layered  system  Is  the  version  of  Samuel's 
program  [Samuel,  1967]  that  learns  a polynomial  evaluation  function  for  selecting  checkers 
moves  (see  the  Appendix  for  details).  The  lower  layer  (LSI)  In  this  system  adjusts  the 
coefficients  of  a given  set  of  game  board  features  In  order  to  Improve  performance  of  the 
move  selection  program.  The  second  layer  system  (LS2)  adjusts  the  set  of  board  features 
used  In  the  evaluation  function  in  order  to  Improve  the  performance  of  LSI.  Since  LSI  is 
contained  in  LS2  as  the  performance  element,  all  the  assumptions  necessary  for  Its  operation 
also  belong  to  the  LS2  world  model.  In  addition,  the  LS2  world  model  contains  assumptions 
about  the  set  of  allowable  game  board  features  and  the  standard  for  evaluating  LSI 
performance. 


20 


HPP-77-30 


A single  layer  LS,  then,  can  never  move  outside  its  world  model  to  make  radical 
revisions  to  its  way  of  viewing  the  task  to  achieve  a paradigm  shift,  as  discussed  by  Kuhn 
[Kuhn,  1970].  However,  a shift  in  the  conceptual  framework  of  LSI  could  be  made  by  a 
properly  programmed  LS2  [Buchanan,  1974].  We  believe  that  a layered  approach  such  as 
that  described  above  provides  a useful  system  organization  for  learning  at  various  levels  of 
abstraction  in  complex  domains.  Although  there  are  examples  of  this  kind  of  layering  in  the 
literature  [Samuel,  1963],  [Uhr,  1963]  and  [Soloway,  1977].  no  one  has  carried  It  as  far 
as  the  model  suggests.  In  fact,  single  layer  learning  systems  are  just  now  becoming  well 
enough  understood  to  consider  developing  more  sophisticated  systems. 


6.10  Implications  of  the  Model 

The  LS  model  described  here  provides  a common  language  for  characterization  and 
comparison  of  different  types  of  learning  systems  that  operate  in  a variety  of  task  domains. 
The  model  is  a useful  conceptual  guide  for  LS  design,  because  it  isolates  the  essential 
functional  components,  and  the  information  that  must  be  available  to  these  components. 

A number  of  desirable  features  for  future  learning  system  designs  are  brought  out  by 
this  model.  First,  the  design  should  be  modular,  with  individual  modules  corresponding  to  the 
functional  components  shown  in  the  model.  The  knowledge  used  by  the  system  should  be 
made  explicit  and  collected,  as  much  as  efficiency  considerations  permit,  in  a world  model 
component.  Especially  the  parts  of  the  LS  that  are  to  be  adjustable  must  be  explicitly 
exposed.  Intelligent  criticism  is  important,  as  is  active  instance  selection,  although  neither 
has  been  isolated  as  a separate  object  of  study.  Finally,  a multi-layer  architecture  for 
learning  at  different  levels  of  abstraction  Is  suggested  by  the  model  as  a way  of  Introducing 
still  more  intelligence  into  the  whole  learning  system. 
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Appendix  A 

Characterization  of  Existing  Systems 


In  this  appendix  several  existing  LSs  are  characterized  using  the  framework  provided 
by  the  model  described  In  Section  6.  The  systems  selected  are  representative  of  several 
approaches  to  machine  learning.  Because  the  blackboard  contains  Information  In  a state  of 
flux,  Its  contents  are  not  specified  explicitly  for  the  systems  characterized  below. 


Modml  Reference  Adaptive  Control,  [Landau,  1 974] 


Purpose:  Construct  a controller  that  preprocesses  inputs  to  an  existing  system  (called 
the  plant).  The  behavior  of  the  combined  controller-plant  system  is  to  mimic  the  behavior  of  a 
third  system  (called  the  reference  model)  on  the  training  data. 

Environment:  The  plant  to  be  controlled,  and  the  set  of  possible  Inputs  (including 
disturbances). 

Performance  Element:  The  controller— a system  whose  output  Is  used  as  input  to  the 
plant.  Its  behavior  is  a function  of  the  Input  signal,  past  I/O  behavior  of  the  plant,  and  a set 
of  adjustable  parameters. 

I 

Instance  Selector:  Accepts  data  sequence  (as  input  to  the  controller)  from  the 
environment. 

Critic:  Evaluation— applies  a measure  of  performance  that  Is  some  function  of  the 
arithmetic  difference  between  the  plant  and  reference  model  outputs.  In  some  cases  the 
reference  model  Is  mathematically  defined,  and  can  therefore  be  considered  part  of  the 
critic.  In  other  cases  the  reference  model  Is  an  actual  system,  and  Is  considered  part  of  the 
environment. 

Learning  Element:  Modifies  the  parameters  of  the  performance  element  (controller), 
depending  on  the  performance  measure  supplied  by  the  critic. 

World  Model:  Control  theory  assumptions  (time  Invariance,  linearity,  etc.)  and 
techniques,  and  the  standard  of  performance  embodied  In  the  critic. 
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Adaptive  Pattern  Classifier,  [Kof ord,  1 066] 


Purpose:  Learn  the  parameters  of  a classifier  that  can  classify  a set  of  patterns  in  such 
a way  as  to  minimize  a specified  cost  functional. 

Environment:  Patterns  drawn  from  a pre-specifled  set  of  classes.  Each  pattern  is 
represented  as  a feature  vector. 

Performance  Element:  A linear  pattern  classifier  that  forms  the  inner  product  of  a 
pattern  feature  vector  (that  constitutes  the  input),  and  a weight  vector  (where  the  weights 
constitute  the  adjustable  parameters  of  the  classifier).  Based  on  the  resultant  scalar  value, 
the  classifier  assigns  the  pattern  to  a class. 

Instance  Selector:  Accepts  instances  from  a human  trainer.  The  classifier  uses  a set  of 
patterns  of  known  class  membership  to  tune  the  weights.  Thereafter,  the  weights  are  held 
constant. 

Critic:  Evaluation— computes  the  difference  between  the  output  value  of  the  classifier, 
and  the  known  acceptable  output  (the  learning  In  this  example  is  supervised). 

Learning  Element:  Modifies  the  weights  used  by  the  classifier  according  to  the  LMS 
algorithm  [Widrow,  1960],  based  on  the  information  received  from  the  critic.  This  algorithm 
attempts  to  adjust  the  set  of  weights  so  as  to  minimize  the  mean-square  error  between  the 
output  of  the  classifier,  and  the  desired  output. 

World  Model:  Pattern  recognition  assumptions  concerning  the  suitability  of  representing 
the  patterns  as  feature  vectors,  the  suitability  of  a statistical  formulation  of  the 
classification  problem,  the  suitability  of  a linear  pattern  classifier,  the  suitability  of  the 
selected  performance  measure,  and  the  specific  adaptation  algorithm. 


Checker  Playar,  [Samuel,  1063],  [Samuel,  1967] 


Purpose:  Learn  to  play  good  game  of  checkers  (here  we  discuss  only  the  version  of  the 
program  that  learns  a linear  polynomial  evaluation  function  by  examination  of  moves 
suggested  by  experts  ( book  moves). 

Environment:  Set  of  all  legal  game  boards. 
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LSI  (lowr  Imyr): 


Purpose:  Learn  a good  set  of  coefficients  for  combining  board  features  In  a linear 
polynomial  evaluation  function. 

Performance  Element:  Uses  the  learned  evaluation  function  to  rank  plausible  moves  for 
a given  board  position. 

Instance  Selector:  Reads  Instances  from  a list  of  pre-deflned  game- 
board/recommended-move  pairs. 

Critic:  Evaluation— examines  the  ranking  given  to  the  book  move  by  the  performance 
element.  Localization— suggests  that  the  book  move  should  be  ranked  above  all  other  moves. 

Learning  Element:  Adjusts  weights  of  linear  polynomial  to  make  move  selection 
correspond  to  the  critic's  recommendation. 

World  Model:  Syntax  of  game  board,  form  and  features  of  linear  polynomial  evaluation 
function,  method  for  adjusting  evaluation  function,  and  rules  of  checkers. 


Purpose:  Improve  the  performance  of  LSI  by  selection  of  a good  set  of  board  features. 

Performance  Element:  LSI. 

Instance  Selector:  The  entire  set  of  possible  training  instances  Is  simply  passed  to  LSI 
(via  the  blackboard). 

Critic:  Evaluation— analyses  the  learning  ability  of  LSI  (l.e.,  the  LS2  performance 
element)  with  the  current  set  of  evaluation  function  features.  Localization— singles  out 
features  that  are  not  useful.  Recommendation— selects  new  features  from  a predefined  list 
to  replace  useless  features. 

Learning  Element:  Redefines  the  current  set  of  features  as  recommended  by  the  critic. 

World  Model:  The  LSI  world  model,  plus  the  set  of  features  that  may  be  considered,  and 
the  performance  standard  employed  by  the  LS2  critic. 
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Poker  Player,  [Waterman,  1970] 


Purpose:  Learn  a good  strategy  for  making  bets  in  draw  poker. 

Environment:  Set  of  all  legal  poker  game  states. 

Performance  Element:  Applies  the  learned  production  rules  to  generate  actions  in  a 
poker  game,  e.g.,  bets. 

Instance  Selector:  Selects  each  game  state  derived  by  play  against  an  opponent  as  a 
training  instance. 

Critic:  Two  versions  of  the  program  use  two  different  critics.  In  both  cases  the  critic 
performs  the  following  functions:  Evaluation— decides  whether  the  poker  bet  made  by  the 
Performance  Element  was  acceptable.  Localization— gives  Important  state  variables  for 
deciding  the  correct  bet.  Recommendation— provides  the  bet  which  the  Performance  Element 
should  have  made.  In  explicit  learning  the  critic  is  an  expert  poker  player  , either  human  or 
programmed.  In  Implicit  learning,  the  evaluation  and  recommendation  are  deduced  from  the 
next  action  of  the  opponent  and  a set  of  predefined  axioms,  while  localization  is  read  from  a 
predefined  decision  matrix. 

Learning  Element:  Modifies  and  adds  production  rules  to  the  system.  Mistakes  are 
corrected  by  adding  a new  rule  in  front  of  the  rule  responsible  for  the  incorrect  response. 

World  Model:  Rules  of  poker,  features  used  to  describe  the  game  state,  the  language  of 
production  rules,  heuristics  for  updating  the  rule  base,  the  model  of  an  opponent. 


Meta-DENDRAL,  [Buchanan,  1978] 


Purpose:  Learn  to  predict  data  points  In  the  mass  spectra  of  molecules. 

Environment:  Set  of  all  known  molecule/data-point  pairs. 

Performance  Element:  Predicts  peaks  (data  points)  In  mass-spectra  of  molecules  using 
learned  production  rules.  Employs  a model  of  mass  spectrometry  for  translating  between 
masa-spectral  processes  (predicted  by  the  rules)  and  data  points  in  the  spectrum. 

Instance  Selector:  Accepts  a set  of  known  molecule/spectrum  pairs  from  the  user. 

Critic:  Evaluation— determines  the  suitability  of  the  set  of  predictions  generated  by  a 
rule.  Localization— states  whether  the  rule  is  acceptable,  too  specific,  or  too  general. 
Recommendation— recommends  adding  or  deleting  features  to  the  left-hand  sides  of  rules. 
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Learning  Element:  Conducts  a heuristic  search  through  the  apace  of  plausible  rules 
using  a predefined  rule  generator.  At  each  step  In  the  search  the  potential  rule's 
performance  Is  reviewed  by  the  critic. 

World  Model:  Representation  of  molecules  as  graphs,  production  rule  model  of  mass 
spectrometry,  vocabulary  of  rules  used  to  represent  learned  information;  heuristics  used  by 
the  critic  In  directing  the  rule  search.  - 


Learning  Structural  Dascrlptlons  from  Examplas,  [Winston,  1970],  [Winston,  1976] 


Purpose:  Learn  to  identify  blocks  world  structures  (such  as  arches  and  towers). 

Environment:  Set  of  possible  line  drawing/structure-classification  pairs. 

Performance  Element:  Decides  class  of  structures  to  which  the  input  structure  belongs. 
Uses  a model  of  the  structure  class  supplied  by  the  learning  element. 

Instance  Selector:  Accepts  training  Instances  supplied  individually  by  the  user. 

Critic:  Evaluation— compares  the  classification  made  by  the  Performance  Element 
against  the  correct  classification  as  supplied  with  each  training  Instance.  Localization— 
generates  a comparison  description  pointing  out  differences  between  the  model  and  the 
structure  description. 

Learning  Element:  Constructs  a model  of  the  class  of  structures  under  consideration. 
Examines  the  comparison  description  supplied  by  the  critic,  and  modifies  the  model  to 
strengthen  or  weaken  the  correspondence  between  the  model  and  the  training  instance. 

World  Model:  Representation  of  scenes  as  line  drawings,  method  of  translating  line 
drawings  to  graphical  descriptions,  grammar  for  drawings  to  graphical  descriptions,  grammar 
for  representing  the  learned  information,  domain-specific  heuristics  for  resolving  among 
possible  changes  to  each  structure  class  model. 
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